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SCALABLE BAYESIAN INFERENCE FOR 
EXCITATORY POINT PROCESS NETWORKS 


By Scott W. Linderman and Ryan P. Adams 
Harvard University 

Networks capture our intuition about relationships in the world. 
They describe the friendships between Facebook users, interactions 
in financial markets, and synapses connecting neurons in the brain. 
These networks are richly structured with cliques of friends, sectors of 
stocks, and a smorgasbord of cell types that govern how neurons con¬ 
nect. Some networks, like social network friendships, can be directly 
observed, but in many cases we only have an indirect view of the net¬ 
work through the actions of its constituents and an understanding of 
how the network mediates that activity. In this work, we focus on the 
problem of latent network discovery in the case where the observable 
activity takes the form of a mutually-excitatory point process, also 
known as a Hawkes process. We build on previous work that has taken 
a Bayesian approach to this problem, specifying prior distributions 
over the latent network structure and a likelihood of observed activ¬ 
ity given this network. We extend this work by proposing a discrete¬ 
time formulation and developing a computationally efficient stochas¬ 
tic variational inference (SVI) algorithm that allows us to scale the 
approach to long sequences of observations. We demonstrate our al¬ 
gorithm on the calcium imaging data used in the Chalearn neural 
connectomics challenge. 


1. Introduction. Networks are abstractions of the relationships and connections between real- 
world objects, such as people, stocks, or neurons. These connections reflect relationships like “Wil¬ 
son and Brady are friends,” or “When neuron A fires it excites neuron B.” Sometimes the networks 
themselves are observed data, as in the case of social network friendships, but often our view of 
the network is indirect. We are left to infer the latent connections between objects based on our 
observations of their behavior. In our neural example, recording techniques can often provide a 
measure of the neurons’ activity but cannot resolve the individual synaptic connections between 
neurons. Given our knowledge of how synapses work, however, we might infer that if one neuron 
consistently fires shortly after another then there is likely an excitatory connection between them. 
This is one example of the latent network discovery problem that this work addresses. 

We focus on the case where our observations come in the form of a series of discrete events, like a 
sequence of Twitter messages or the hring pattern of a population of neurons. These events do not 
happen independently; rather, events induce subsequent events according to an excitatory network 
of interactions. A connection from one object to another indicates that events by the first object 
increase the probability of subsequent events by the second. We model these observations with a 
multivariate Hawkes process, a type of point process tailored to excitatory networks of interaction. 

Building on the previous work, we combine the Hawkes process observation model with a prior 
distribution over networks in a unified Bayesian model (Simma and Jordan, 2010; Blundell et ah, 
2012; Perry and Wolfe, 2013; DuBois et ah, 2013; Linderman and Adams, 2014; Guo et ah, 2015). 
Most real-world networks are not simply random, but have highly structured patterns of interaction. 
For example, a friendship can often be traced back to some commonality between two people, such 
as belonging to the same club or attending the same school. A simple model for social networks 
might assign each person to a group, and then connect people according to whether or not they are 
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Fig 1: Components of the Network Hawkes process model, (a) Events induce weighted impulse responses 
on downstream processes, spawning more events according to the Hawkes process dynamics, (b) Underlying 
these dynamics is a weighted, directed network. Here, the network is a stochastic block model with two 
clusters of processes (red and blue). Processes primarily connect to others of the same type, though there 
is a small probability of connection from red to blue, (c) We model the impulse responses as a convex 
combination of normalized basis functions. 


in the same group. This is known as a stochastic block model (SBM) (Nowicki and Snijders, 2001), 
and is one example of a random network model that may serve as prior probability distribution 
over networks in a Bayesian framework. 

We improve upon previous work by providing a discrete-time analogue of the Hawkes process 
that is considerably more efficient on datasets with high rates of activity, and by devising an efficient 
stochastic variational inference (SVI) algorithm that can scale to long sequences of observations. 
Most previous work has relied upon Markov chain Monte Carlo (MCMC) methods for inference, 
which must consider the entire observation sequence when evaluating the likelihood of a state, and 
which can be prone to poor convergence. SVI provides an alternative method of inference that 
can work with mini-batches (small subsets) of observations per iteration, and has been shown to 
yield dramatic improvements in a variety of large-scale machine learning problems. Code for the 
models and algorithms developed in this paper is available at https://github.com/slinderman/ 
pyhawkes. 

2. Related Work. Hawkes processes (Hawkes, 1971) are multivariate point processes that 
relate excitatory interaction networks to sets of discrete events. For example, suppose we have a 
Hawkes process with two constituent processes. An event on the hrst process could add an impulse 
response to the future rate of the second process. More generally, a multivariate Hawkes process 
consists of K individual point processes with rates {Afc(t)}|^^ that depend upon the events that 
have occurred up to time t. This dependence on preceding events distinguishes the Hawkes process 
from the Poisson process. Given the history of events up to time t, however, Hawkes processes have 
conditionally Poisson dynamics. 

Hawkes processes have additive, excitatory interactions. Each event adds a nonnegative impulse 
response to the future rate of connected processes. The rate of the A:-th process is 

K Nk 

(1) v(«) = a/) + ll['Sfc,n ^ ■ hk'^kit 'Sfcjn); 

k'=l n=l 

where {sk^n}n=i ^ [0,T]^'' is the set of event times for events on process k, is the total number 
of events on process k, is the “background rate” of the /c-th process, and hy^ki^t) is an 
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impulse response function describing the amplitude of influence that events on process k’ have on 
the rate of process k at time lag At. 

The Hawkes process is closely related to the generalized linear model (GLM) with Poisson obser¬ 
vations, which is widely used in computational neuroscience (Paninski, 2004; Pillow et ah, 2008). 
In fact, the Hawkes process is a special case of the Poisson GLM in which the link function is linear 
and the impulse responses are non-negative. As in the Poisson GLM, the negative log likelihood 
of the Hawkes process is convex, enabling efficient maximum likelihood and maximum a poste¬ 
riori estimation. The advantage of the Hawkes process is that its linear form allows for elegant 
fully-Bayesian inference — a task which is non-trivial in the Poisson GLM due to the lack of con- 
jugacy. With these Bayesian inference algorithms, we can estimate and reason using the posterior 
uncertainty of the model. 

One of the earliest applications of Hawkes processes in machine learning was the work of Simma 
and Jordan (2010), which developed an expectation-maximization algorithm based upon the aux¬ 
iliary variable parent formulation. Observing that Xk{t) is a sum of impulse responses, Simma and 
Jordan (2010) invoked the Poisson superposition theorem to motivate a set of auxiliary “parent” 
variables, Zk,n, which denote the origin of the n-th event on process k, either the background rate 
or an impulse response from a previous event. Gonditioned upon these auxiliary variables, the 
likelihood factorizes over impulse responses. 

Subsequent works leveraged this intuition to extend Hawkes processes with interpretable prior 
distributions over the network of impulse responses. Blundell et al. (2012) introduced an infinite 
relational model prior over the network of interactions as well as a Gibbs sampling algorithm for 
fully Bayesian inference. DuBois et al. (2013) explored the use of infinite relational models as a prior 
in conjunction with a point process observation model and a Gibbs sampling inference algorithm. 
Recently, Linderman and Adams (2014) developed a general framework for combining random 
“spike-and-slab” network models with Hawkes processes that uses a continuous time formulation 
and an auxiliary Gibbs sampling inference algorithm. Guo et al. (2015) have developed a similar 
model that applies Hawkes processes to language modeling problems and incorporates features of 
the discrete events. 

We extend the work of Linderman and Adams (2014) by addressing two shortcomings: the limited 
scalability of the continuous time formulation which introduces auxiliary variables for each event 
in the dataset, and the batch nature of their Gibbs sampling algorithm. We address the former by 
deriving a discrete time version of their model which vastly outperforms the continuous time version 
on datasets with high rates of activity. To overcome the batch nature of the Gibbs algorithm, we 
make an approximation to the spike-and-slab network model that renders the model fully conjugate, 
thereby enabling efficient stochastic variational inference (Hoffman et ah, 2013) on mini-batches of 
data. 

3. The Discrete Time Network Hawkes Model. The fundamental limitation of the previ¬ 
ously developed continuous time models is that the number of values that the auxiliary variable Zk^n 
can take grows with the number of events which occurred before time Sk^n- For datasets with high 
rates of activity, this can quickly become the limiting factor of the inference algorithm. At the 
same time, it is often reasonable to assume that events do not interact on time scales faster than 
some At. This motivates a discrete time formulation in which we bin events in bins of width At 
and ignore potential interactions between events in the same bin. Then the rate becomes, 

K t-i 

Xt,k — Xf. -|- ^ ^ ^ ^ St '■ hk'^k[t ^ ]) 

fc'=lt'=l 


(2) 
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where s G is the matrix of event counts and hki_^k[t — t'] is an impulse response function 

describing the amplitude of influence that events on process k' have on the rate of process k at 
discrete time lag t — t'. As we will show, under this formulation the auxiliary variables only assume 
a fixed set of values independent of the rate. 

In order to incorporate the network model as a prior distribution for the Hawkes process, we 
follow the approach of Linderman and Adams (2014) and decompose the impulse response function 
into the product of a scalar weight that specifies the strength of the interaction and a probability 
mass function that specifies the time course of interaction: 


B 

hk^k'[d] = Wk^k'Gk^k'[d\ = 

b=l 

for d G {1,..., H}. Here, W G is a non-negative weight matrix drawn from a spike-and-slab 

prior. 


^k^k' ~ Bern(Afc^fc/ \ pk^k'), 


Wk^k' ~ 


Gimi{Wk^k' I K,Vk^k') 

SoiWk^k') 


if Ak^k' = 1, 
if Ak^k' = 0. 


The network model provides pk^k', the probability of a directed connection from node k to k', 
and Vk^k'j the inverse scale of the gamma distribution over the corresponding connection weight. 
We assume that k, the shape of the weight prior, is fixed for simplicity. The matrix A = 
is a binary, directed adjacency matrix indicating the presence or absence of a connection for each 
pair of nodes, and the matrix W = {{Wk^k'}} is a non-negative weight matrix denoting the 
strength of each connection. 

The vector Gk^k' is a normalized probability mass function. We model Gk^k'\<d\ as a con¬ 
vex combination of basis functions, 0^, which are normalized such that (f)h[di\^t = 1, 

and require 9b^’^ ^ ^ under our model. This constraint is implemented via a Dirichlet 

prior, ~ Dir( 7 ). 

Intuitively, the weight Wk^k' specifies how many child events on process k' will be caused by a 
single event on process k. Then, Gk^k'[d] specifies the probability that the child event will occur 
at lag dAt. In fact, this procedure is the basis of a recursive algorithm for generating samples from 
the discrete time Hawkes process. 

Figure 1 illustrates the basic components of the model. In this example, we have a stochastic 
block model with two types of processes (red and blue) that preferentially interact with processes 
of the same type. Events induce weighted impulse responses on downstream processes according to 
an underlying latent network (Fig. Ib). The impulse responses, Gk^k'[d], are modeled as a convex 
combination of basis functions (Fig Ic). The impulse response is weighted according to the strength 
of the connection, Wk^y before being added to the rate of downstream processes. 

With fixed basis functions, we can write the instantaneous discrete time in a simplified form. 


( 3 ) 


( 4 ) 


\k = A® + X] X] ^ sti^k'(i>b[t -1'] 

k'=l b=l t'=l 

k'=l b=l 


where = {^■.,k' * 4^b)[A be precomputed. Here, the instantaneous rate reduces to a sum of 
a weighted inputs, which suggests a Bayesian inference algorithm via data augmentation. 
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Fig 2: Comparison of run time per Gibbs sweep for the discrete and continuous network Hawkes formulations. 
Best fit lines added. 


4. Inference with Gibbs Sampling. As in other preceding work (Simma and Jordan, 2010), 
we begin by introducing auxiliary parent variables for each entry st^k- By the superposition theorem 
for Poisson processes, each event can be attributed to either the background rate or one of the 
impulse responses. 

Let G {0,..., St^k'} denote how many of the events that occurred in the t-th time bin on 

the k'-th. process are attributed to the 6-th basis function of the A:-th process. Similarly, let zf^, 
denote the number of events attributed to the background process. We combine these auxiliary 
variables into vectors, Zt^k' — • • ■ > • 

Due to the Poisson superposition principle, these parent variables are conditionally multinomial 
distributed. For time t and process k', we resample 


zt^k' ~ Mult {st,k',ut,k') 


- - 
%k' - X 


( 0 ) 

k' 


Am _ 
A,k' — 


{k,k')^ 

^^k^k' 9h ^t,k,b 


’'‘t,k' ’ A,k' 

Given this attribution, the likelihood factorizes into a product of Poisson distributions. 


p{z I A) = 


K 


n n I 


u=l fc'=i 


T K K 


n n n i Wk^k'ot’'^ ^st,k,b^t) 


{k,k')' 


U=1 k=l k'=l 


Gibbs sampling the background rates.. We use conjugate priors for the constant background rates, 
weights, and impulse responses. For the constant background rates we have, Gam.{ax,/3\), 

which results in the conditional distribution 


4'^ I {4fc} ~ Gam(af\^f^), 


(fc) I (0) 

«A + 

i=l 


= Px + TAt. 


Gibbs sampling impulse responses.. The likelihood of the impulse responses, jg proportional 

to a Dirichlet distribution. Combined with a Dirichlet( 7 ) prior this yields 


I ^ Dirichlet ^ 


(k,k') (k,b) 

% =A + 2^zlk'- 


(kb) 


t=l 
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Gibbs sampling the weighted adjacency matrix.. Given the adjacency matrix A and the parents, 
the weights follow a gamma distribution, 

T B T 


Wk- 


>k' 


Ak^k' = 1 ~ Gam(K 


{k,k') 


,ik,k')^^ 




+EE 


yik,b) 

■'t,k' ’ 


= Vk- 




"fc' “1“ / ^ Stjfc 


t=l b=l 


t=l 


Following Linderman and Adams (2014), to resample A, we iterate over each entry and sample 
from the marginal distribution after integrating out the parents. We assume the parameters of the 
network prior can be sampled efficiently — a reasonable assumption for many random network 
models. 

The continuous time representation introduces a latent “parent” variable for each event in the 
dataset, and the parent can be any one of the events that occurred in the preceding window of 
influence. Call the number of potential parents M. The discrete time representation has a multi¬ 
nomial random variable for each time bin that contains at least one event, and the support of this 
multinomial is always a fixed size, KB -|- 1. When the rate of events is high, KB -|-1 <C M, allowing 
for dramatic improvements in efficiency in the discrete case. 

Figure 2 shows the time per full Gibbs sweep as a function of the number of events per discrete 
time bin for the discrete and continuous formulations. The discrete formulation incurs a constant 
penalty whereas the continuous formulation quickly grows with the event rate. For low rates, the 
continuous formulation can be advantageous, but the discrete model is vastly superior in many 
realistic settings. For example, Linderman and Adams (2014) worked with trades on the S&PlOO, 
which occur tens or hundreds of times per second for each stock. Since the complexity of their 
algorithm grows with the number of events, they down-sampled the data to consider only the times 
when a stock price significantly changed. 


5. Stochastic Variational Inference. The discrete time formulation offers advantageous 
complexity compared to the continuous analogue, but it still must resample the entire set of parents 
each iteration in order to maintain the invariance of the posterior distribution. In many cases, a 
mini-batch of time bins can provide substantial information about the global parameters of the 
model, and rapid progress can be made by iterating quickly over subsets of the data. This motivates 
our derivation of a stochastic variational inference (SVI) algorithm for the network Hawkes process. 

Variational methods optimize a lower bound on the marginal likelihood by minimizing the KL- 
divergence between a tractable approximating distribution and the true posterior. Since the data- 
local variables (e.g., the parent identities) are conditionally independent given the global parame¬ 
ters (W, g, etc.), our variational approach will easily extend to the stochastic setting in which we 
compute unbiased estimates of the gradient of the variational objective using mini-batches of data. 

The primary impediment to deriving a variational approximation is the non-conjugacy of the 
spike-and-slab prior on the weights. To overcome this, we approximate the spike-and-slab prior 
with a mixture of gamma distributions, as has previously explored by Grabska-Barwinska et al. 
(2013): 


p{Ak^k') = Bern(Afc^fc/ \pk^k’), 


p{Wk^k' I Ak^k') 


Gam(lFfc^fe/ | K,Vk^k’) 
Gam(lFfc_^fe/ | kq , vq ) 


if Ak^k' = 1, 
if Ak^k' = 0. 


As Ko 0 and vq ^ oo, the second mixture component converges to a delta function at zero and 
recovers the true spike and slab model. As we relax this prior, the weights will be nonnegative 
when A = 0, but they will be small relative to the weights when A = 1. Importantly, with this 
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prior the model is rendered conjugate and amenable to a matching variational factor for each 
pair Following Lazaro-Gredilla and Titsias (2011), let, 


(5) 

( 6 ) 


q{Ak^k’) = Bevn{Ak^k' \Pk^k'), 


Q{Wk^k' I 


('TTA \~{k,k') ~(k,k')\ 

Gam(lFfc^A:'I) 

nur \ ~(k,k') ~(k,k')\ 

Gam(W4^fc'|«o ) 


if Ak^h’ = 1, 
if Ak^k’ = 0. 


The rest of the variational approximation is fully factorized. Since the model is now conjugate, 
factors are easily derived. We provide a complete derivation in Appendix B and state the final 
forms here. 

Variational updates for parent variables, q{zt). For the parent variables, the variational updates 
are 


(7) q{zt^k') = I st,k',ut,k') 

(8) I® A [In I 

(9) oc st,fc,bexp + Ew[lnWk,k']'j ■ 

Variational updates for background rates, q{X^^'^). The variational form parameters of the gamma 
distribution over background rates are 

qi^'k^) = Gam(Ai,°^ | = ax v'^E^ 

t=i 

Variational approximation for impulse response parameters, q{g^^’^''>). With the conjugate prior 
formulation the variational parameter updates for the Dirichlet distributed impulse response pa¬ 
rameters are 


,( 0 ) 

''t,k 


/ 3 f =/ 3 A + rAt. 





T 

lb + lEz 
t=l 




(k,b) 

t,k' 


Variational approximation for the weighted adjacency matrix.. The primary motivation for adopt¬ 
ing a weakly sparse mixture of gamma distributions is to derive an efficient variational inference 
algorithm. The mixture-of-gammas prior is conjugate with the Poisson observations, and hence the 
variational distribution is also a mixture of gammas: 




q{Wk^k' I Ak^k' = 1) = Gam(lFfc^fc/ | 

T B T 

{k,b)] ror i , 

'Ik' V\ ’ = E[Vk^k'\ + 2^ St,k , 


K + 


t=l 6=1 


t=l 


and likewise for the “spike” factor, 

q{Wk^k' I Ak^k' = 0) = Gam{Wk^k' \ 


lk,k') 


T B 


— ^0 + 


EE^K' 


{k,b) 

k' 


t=l 6=1 


lk,k') 


k'O + ■ 


t=l 
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Fig 3: Predictive log likelihood versus wall clock time for three Bayesian inference algorithms on a dataset 
of AT = 50 processes and T = 10^ and T = 10® time bins on the left and right, respectively. 


This leaves us with which is Bernoulli distributed with parameter Collecting 

all the terms that include and lack Wk^k' yields 

Pk^k' _ exp{E[lnfffc^fc']} (exp{E[lnufc^fc']})'" r(/to) ^ 

1-pk^k' exp{E[ln(l^ r(K) ^ ^ ^ 

As with Gibbs sampling, we assume a variational approximation for the network model can be 
derived, and provide access to the necessary expectations, E[lnpfc^fc/], E[ln(l — pk^k')], 
and E[lnufc_^fc/]. 

As aforementioned, the time bins are conditionally independent given the network weights and 
the adjacency matrix — a common pattern exploited by stochastic variational inference (SVI) 
algorithms (Hoffman et ah, 2013). These methods optimize the variational objective using stochastic 
gradient methods that work with mini-batches of data. Often, a mini-batch of data can provide 
valuable information about the global parameters, in our case the network and background rates. 
Quickly iterating over these global parameters allows us to reach good modes of the posterior 
distribution in a fraction of the time that batch variational Bayes and Gibbs sampling require, 
since those methods must process the entire dataset before making an update. SVI does require 
some tuning, however. In particular, we must set a mini-batch size and a step size schedule. In this 
work, we fix the mini-batch size to Tmb = 1024 and set the step size at iteration i to {i + 

These parameters may be tuned with general purpose hyperparameter optimization techniques 
(Snoek et ah, 2012). 

6. Synthetic Results. We assess the performance of the proposed inference algorithms on 
a synthetic dataset generated by a strongly sparse, discrete time Hawkes process with K = 50 
processes. The network is an Erdds-Renyi graph with uniform connection probability of p = 0.08, 
and the weights are independently and identically distributed with a Gam(3,15) prior. We simu¬ 
late T = 10® time bins in steps of size At = 1. The processes have a mean background rate of 1.0 
event per time bin and, due to the network interactions, the average total rate of the processes 
is 16.7 ± 12.0 events per bin. Referring to Figure 2, this is a regime that favors the discrete model. 
Then we trained the strongly sparse discrete time model using Gibbs sampling, and the weakly 
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Fig 4: Application of the network Hawkes model to a connectomics challenge. The data are in the form of a 
calcium fluorescence trace, which we preprocess to extract neural spike times (a). We measure performance 
on a link prediction task using a precision-recall curve and find that the posterior estimates of SVI provide 
the best estimates on some networks (b). In addition to an estimate of the connection probability and weight, 
SVI provides an estimate of the posterior uncertainty, (c) shows Eq[A]/stdq[A] for the first 20 neurons. True 
connections are outlined in black. 


sparse discrete time model using either batch variational Bayesian inference (VB) or stochastic 
variational inference (SVI) 

We evaluated the algorithms in terms of their predictive log likelihood on a held-out dataset 
of length Ttest = 10^. First, we trained the models on only the hrst T = 10^ time bins of data. 
Figure 3 (left) shows the predictive log likelihood as a function of wall-clock time, measured in 
units of nats per event improvement over a homogeneous Poisson process baseline. We initialized 
the Gibbs sampling and variational inference algorithms by rounding the cross-validated MAP 
estimate as described in the appendix. For this relatively short dataset, SVI and batch VB converge 
at comparable rates, achieving competitive predictive log likelihood in a matter of minutes. Gibbs 
sampling for the strongly sparse model converges at a considerably slower rate, but eventually 
outperforms the variational results from the weakly sparse model. The MAP estimate, even with 
cross validated regularization, underperforms the other competing algorithms. 

This trend is amplified when we consider the entire training set of size T = 10^. Figure 3 (right) 
illustrates the power of SVI in handling these large time datasets. Considerable information about 
the global parameters (e.g., the network) can be gained from just a mini-batch of time points. 
Hence, we can make rapid improvements in predictive log likelihood very quickly. By contrast, 
each step of the Gibbs and batch VB algorithms is approximately 10 times slower, and even after 
computing sufficient statistics over the entire dataset, the algorithm is only able to make limited 
progress per iteration. 

7. Connectomics Resnlts. We tested these inference algorithms on the data from the 
Chalearn neural connectomics challenge^ (Stetter et ah, 2012). The data consist of calcium flu¬ 
orescence traces, F, from six networks oi K = 100 neurons each. We use ten minutes of data at 
50Hz sampling frequency to yield T = 3 x 10® entries in S. In this case, the networks are purely 
excitatory, and each action potential, or spike, increases the probability of the downstream neurons 
firing as a result. This matches the underlying intuition of the Hawkes process model, making it a 

^http://connectomics.chalearn.org 



















10 


S. W. LINDERMAN AND R. R ADAMS 



Network I 

Network 2 

Network 3 

Network 4 

Network 5 

Algorithm 

ROC 

PRC 

ROC 

PRC 

ROC 

PRC 

ROC 

PRC 

ROC 

PRC 

xcorr 

0.596 

0.139 

0.591 

0.133 

0.701 

0.198 

0.745 

0.296 

0.798 

0.359 

MAP 

0.607 

0.174 

0.619 

0.143 

0.698 

0.178 

0.790 

0.334 

0.859 

0.408 

SVI 

0.649 

0.184 

0.605 

0.141 

0.673 

0.176 

0.774 

0.342 

0.844 

0.410 


Table 1 

Comparison of inference algorithms on link prediction for five networks from the Chalearn connectomics challenge. 
Performance is measured by area under the ROC curve and area under the precision recall curve (PRC). In four of 
the five networks a Hawkes process model provides the best results. 


natural choice. 

In order to apply the Hawkes model, we first convert the fluorescence traces into a spike count 
matrix using OOPSI, a Bayesian inference algorithm based on a model of calcium fluorescence 
(Vogelstein et ah, 2010). The output is a filtered fluorescence trace, F, and a probability of spike 
for each time bin. We threshold this at probability 0.7 to get aT x K binary spike matrix, s. This 
preprocessing is shown in Figure 4a. 

Figure 4b shows the precision-recall curve we used to evaluate the algorithms’ performance on 
network recovery. As a baseline, we compare against simple thresholding of the cross correlation 
matrix. On Network 6, SVI offers the best network inference. Table 5 shows the results on the other 
five networks using the same model parameters. On 4/5 of these networks, the Bayesian methods 
offer the best performance. 

Figure 4c illustrates one of the main advantages of the fully Bayesian inference algorithm - 
calibrated estimates of posterior uncertainty. Here we show the SVI algorithm’s estimate of the 
posterior mean of A normalized by the posterior standard deviation for a subset of 20 neurons 
from Network 6. We also outline the true connections to show that the most confident predictions 
are more likely to correspond to true connections. Such estimates of the posterior uncertainty are 
not available with standard heuristic methods or point estimates. 

8. Conclusion. We presented a scalable stochastic variational inference algorithm for the prob¬ 
lem of Bayesian network discovery with Hawkes process observations. Building on previous modeling 
work, we leveraged a weak sparsity model to obtain a fully conjugate model. We focused on scaling 
to long duration recordings (large T). Scaling to large networks (large K) is nontrivial due to de¬ 
pendencies among weights, but in future work we hope to explore approximate algorithms to tackle 
this important problem. 
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APPENDIX A: DERIVATION OE THE GIBBS SAMPLING ALGORITHM 

The updates for the weights and the background rates are straightforward extensions of those 
presented in (Linderman and Adams, 2014). The only nontrivial derivation of the Gibbs sampling 
is the update for the impulse response parameters. The likelihood of the impulse responses, ) 
is proportional to a Dirichlet distribution. This can be seen by observing that the Hawkes pro¬ 
cess can be sampled by recursing over events. Eor each event we sample the number of offspring 
from Poiss(IEfc_,.fc/), and then we sample the offsets of those offspring from Gk^k’[d]. Since G is 
modeled as a convex combination of basis functions, the choice of basis function is essentially a 
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draw from a categorical distribution. We can also derive this directly from the likelihood: 

Pi{{4’!i^^}t=i}b=i I '^) 0^ n n Poisson I ^st,k,b^t 


T B 


t=lb=l 


oc 
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n n ht,k,b^4j X exp ^-Wk,k'g^4'^ 
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exp -Wk,k'gb^ y~lsi,fc. 
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exp -Wk^k'Nk^g^ 
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exp 
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OC Dirichlet ( 

Li=l 1=1 

Combined with a Dirichlet( 7 ) prior this yields, 

I Dirichlet (V), 

T 

, , {k,b) 

lb = lb + 2_^zl^y’. 

1=1 


APPENDIX B: DERIVATION OE THE VARIATIONAL INEERENCE ALGORITHM 

Our goal is to approximate the posterior distribution, p{z, A, W, 0net | s), with a varia¬ 

tional distribution, q{z, A, W, X^^\g, Onet), by minimizing the KL-divergence between q and p. As 
before, we have introduced auxiliary parent variables, 2 , to decouple the events attributed to each 
potential parent process. We then restrict q to take a factorized form, 

T K K K 

(10) q{z, A, W, fif, 0net) = n ^(^*) n n n 9{Ak,k', Wk,k')q{g^’'’’"'^)q{0net) 

1=1 k=l k=lk'=l 

This is essentially the same as a typical factorized approximation except that we have combined Ak^k' 
and Wk^k' into a single factor, as in Lazaro-Gredilla and Titsias (2011). This allows a multimodal 
spike-and-slab approximating distribution. 

With this factorized approximation each individual factor must satisfy a set of mean field con¬ 
sistency equations in which the log of each factor is equal (up to a constant) to the expectation of 
the log posterior under the remaining variational factors. 
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Variational approximation for parent variables, q{zt). For the parent variables, the consistency 
equations imply, 

In q{zf^,) = E;^o \np{zf^,,A, W, | s) + const. 

= E;^o lnp{zj:^l I + const. 

=In A® + const., 

and 

lnq{zl’^jj’^) =EA,w,g A,W, g \ s) + const. 

= E^,vF,g I A, W, gf I s) + const. 

= - In - Ew,A,g Wk,k'gf'^ ^ St,k,b + ,A,g In I VFfc,fc/gf+ const. 

Combined with the constraint that z^y + Ylkb^^k’'^ ~ St^k': this is the form of a multinomial 
distribution with variational parameter Ut^k'-, 

(11) q{zt,k') = Multinomial(zt,fc/ | st,k',Ut,k'), 

= ^exp{E[lnAf]} 

ufy'’ = ^St,fc,bexp |E[lng^^’^ ^]| exp {E[lnlFfc,fc']} 

K B 

Z = exp|E[lnA^°^]| + ^ ^ St,fc,b exp |E[lng[^’^ ^]| exp {E[lnlFfc,fc/]} . 

k=l b=l 

Variational approximation for impulse response parameters, q{g^^’^''i). The consistency equations 
yield, 

lng(g{*:A')) = [lnp{zt^k,W,g\ s)] + const. 

= ^ 1^76 + z^'^y'^ \ Ingf^ + const. 

6=1 V t=i ’ / 

With the conjugate prior formulation the variational approximation is again a Dirichlet, 

(12) g(g(*'’^')) = Dirichlet(7^^’^')) ^ ^ z^’^y^ . 

t=i 

Variational approximation for constant background rates, q{X^^^). The variational form for ^(A^^^) 
is also determined by the conjugate model. We have, 

lng(A[°^)=E 2 lnp{zt^k,A,W,X^^\g\s) + const. 

T 

= ^ - A® At + Ez zf^l In A^°^ + (oa - 1) In A® - /3 aA® + const. 
t=i 

This is the form of a gamma distribution with variational parameters, 

= Gam(5^^\3^1^^) = q;a + ^Ez zfj. 

t=i 


(13) 


/3f ^ = /3 a + TAt. 
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Variational approximation for spike-and-slab weights, Wk^k')- Returning to the variational 

factor for 2 : in Equation 11, we see that a problem arises with the spike-and-slab model. If our model 
and our variational approximation are to share a spike-and-slab formulation then the expected log 
weight will be, 


EvF[lnII4,fc'] = E^Evi/| A[lnIEfc,fc/] = pE\[iiWk,k' \ Ak,k’ = 1] -h (1 - p) InO = - 00 . 

To avoid this degeneracy, we replace the delta function with a gamma distribution, p{Wk,k' \ Ak^k’ = 
0) = Gam(fi; 0 ) which converges to a delta function as kq —)• 0 and vq —)• 00 . The prior on weights 
is then a mixture of two gamma distributions, and is conjugate with the Poisson observations. This 
in turn implies a variational distribution that is also a mixture of gammas. We derive this with the 
consistency equations, 
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As expected, q{Wk,k' \ Ak,k') has the form of a gamma distribution. 
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This leaves us with which we take to be Bernoulli distributed with parameter Pk,k'- This 

implies, 


Ak,k' Inpfc^fc'+ lnGam(Kf’^ + (1 - ln(l - + lnGam(K[)^’^ \ = 

Ak,k' [Ee^^Jlnpfc^fc'] +IE0,,„[lnGam(K,Ufc^fc/)]] + (l-^fc,fc/) [E0,,Jln(l - pk^k')] + lnGam(Ko, uo)] • 

Collecting all the terms that include Ak^k' and lack Wk,k' yields, 


Pk^k' _ 

1 - Pk^k' 

exp{Eg„^Jlnpfc^fc']} ^ (exp{Eg„^Jlnufc^fc']})^ ^ ^ r(«;o) ^ ^ 

exp{E0„^Jln(l-pfc^fc/)]} T{k) {vo^o 

Variational updates for the network model. In the above derivations, the only requirement on the 
network model is that it provide the following expectations: K\pk^k'], IE[lnpfc^fc/], E[ln(l —pk^k')], 
E[ufc_^fc/], and E[lnufc^fc']. It may make sense to fix one of these values; for example, we may have a 
good empirical estimate of the weight scale, Vk^k’-, and therefore do not need to model its posterior 
distribution. Alternatively, we may know the overall sparsity but be uncertain of the scale. For some 
models, computing the necessary variational expectations and deriving the variational updates is 
straightforward. For example, a stochastic block model (Nowicki and Snijders, 2001) with a gamma 
prior on the scale is fully conjugate and admits closed form updates. 

Completing the variational expectations. Now we can compute the necessary expectations for the 
variational parameter updates. These are all available in closed form. 



E 



E[lnWk,k'] 


= -ln^f\ 

= Pk^k' - Inuf+ (1 - Pk^k') 
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where ^f{■) denotes the digamma function. To complete the variational updates for the background 
rate, impulse response, and weight parameters we have, 
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With this final expectation, the parameter updates for the variational weight distributions simplify 
to '^ = Vo + Nk and ^ = E[vk^k'] + Afk. 


B.l. Initialization. Variational inference algorithms can be quite sensitive to initialization of 
their model parameters. A poor initialization may converge to a suboptimal mode of the posterior 
distribution. In order to initialize our algorithm, we first fit a standard (log concave) Hawkes 
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process using MAP estimation on a subset of the data. We use an exponential prior on the weights, 
equivalent to LI regularization, and we tune the scale of the prior with cross validation. To covert 
the weights inferred under the standard Hawkes model to a sample of the network Hawkes model, 
we keep the largest p-fraction of the weights and set the remainder to zero. Finally, we initialize 
the variational parameters of the network Hawkes model such that they are peaked at the sample 
values. For example, we set ^ = 100 and ^ = 100 • 



